Using Association Rules of the Online Retail Dataset
1 Load Data
We first want to load our datasets and prepare them for some simple association rules mining.
tnx_data_tbl <- read_rds("data/retail_data_cleaned_tbl.rds")
tnx_data_tbl %>% glimpse()## Rows: 1,021,424
## Columns: 23
## $ row_id <chr> "ROW0000001", "ROW0000002", "ROW0000003", "ROW000000…
## $ excel_sheet <chr> "Year 2009-2010", "Year 2009-2010", "Year 2009-2010"…
## $ invoice_id <chr> "489434", "489434", "489434", "489434", "489434", "4…
## $ stock_code <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ description <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY …
## $ quantity <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, …
## $ invoice_date <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 200…
## $ price <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55…
## $ customer_id <chr> "13085", "13085", "13085", "13085", "13085", "13085"…
## $ country <chr> "United Kingdom", "United Kingdom", "United Kingdom"…
## $ stock_code_upr <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ cancellation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ invoice_dttm <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-0…
## $ invoice_month <chr> "December", "December", "December", "December", "Dec…
## $ invoice_dow <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday…
## $ invoice_dom <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01"…
## $ invoice_hour <chr> "07", "07", "07", "07", "07", "07", "07", "07", "07"…
## $ invoice_minute <chr> "45", "45", "45", "45", "45", "45", "45", "45", "45"…
## $ invoice_woy <chr> "49", "49", "49", "49", "49", "49", "49", "49", "49"…
## $ invoice_ym <chr> "200912", "200912", "200912", "200912", "200912", "2…
## $ stock_value <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59…
## $ invoice_monthprop <dbl> 0.04347826, 0.04347826, 0.04347826, 0.04347826, 0.04…
## $ exclude <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
To use our rules mining we just need the invoice data and the stock code, so we can ignore the rest. Also, we ignore the issue of returns and just look at purchases.
tnx_purchase_tbl <- tnx_data_tbl %>%
filter(
quantity > 0,
exclude == FALSE
) %>%
select(invoice_id, stock_code, customer_id, quantity, price, stock_value, description)
tnx_purchase_tbl %>% glimpse()## Rows: 994,628
## Columns: 7
## $ invoice_id <chr> "489434", "489434", "489434", "489434", "489434", "489434"…
## $ stock_code <chr> "85048", "79323P", "79323W", "22041", "21232", "22064", "2…
## $ customer_id <chr> "13085", "13085", "13085", "13085", "13085", "13085", "130…
## $ quantity <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, 18, 3,…
## $ price <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55, 3.75…
## $ stock_value <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59.50, 3…
## $ description <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY LIGHTS…
We now write this data out as a CSV so arules can read it in and process it.
tnx_purchase_tbl %>% write_csv("data/tnx_purchase_tbl.csv")2 Basket Analysis with Association Rules
We now want to do some basic basket analysis using association rules, which tries to determine which items get bought together, similar to taking a graph approachin many ways.
basket_arules <- read.transactions(
file = "data/tnx_purchase_tbl.csv",
format = "single",
sep = ",",
header = TRUE,
cols = c("invoice_id", "stock_code")
)
basket_arules %>% glimpse()## Formal class 'transactions' [package "arules"] with 3 slots
## ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
## ..@ itemInfo :'data.frame': 4965 obs. of 1 variable:
## .. ..$ labels: chr [1:4965] "10002" "10002R" "10080" "10109" ...
## ..@ itemsetInfo:'data.frame': 41410 obs. of 1 variable:
## .. ..$ transactionID: chr [1:41410] "489434" "489435" "489436" "489437" ...
Now that we have this data we can look at some basic plots much like we produced before. For example, we can look at the relative frequency of the different items.
itemFrequencyPlot(basket_arules, topN = 20)itemFrequencyPlot(basket_arules, topN = 20, type = "absolute")The stock codes do not mean a huge amount to us, so we also want to look at the description field for these items.
freq_codes <- itemFrequency(basket_arules) %>%
sort(decreasing = TRUE) %>%
head(20) %>%
names()
tnx_purchase_tbl %>%
select(stock_code, description) %>%
filter(stock_code %in% freq_codes) %>%
distinct() %>%
drop_na(description) %>%
group_by(stock_code) %>%
summarise(
.groups = "drop",
desc = str_c(description, collapse = " : ")
) %>%
arrange(stock_code) %>%
datatable()2.1 Basic Concepts
The basic ideas of association rule mining and basket analysis draws on basic ideas from probability theory.
We speak in terms of the itemset: that is, a collection of one or more items that co-occur in a transaction.
For example, suppose we have a list of transactions as follows:
| ID | Items |
|---|---|
| 1 | milk, bread |
| 2 | bread, butter |
| 3 | beer |
| 4 | milk, bread, butter |
| 5 | bread, butter |
Using the above set of transactions, and itemset may be “milk” or “bread, butter”.
The support of an itemset \(X\), \(\text{Supp}(X)\), is defined as the proportion of transactions in the dataset which contain the itemset.
In the above example:
\[ \text{Supp}(\text{\{milk, bread\}}) = \frac{2}{5} = 0.40. \]
A rule, \(X \Rightarrow Y\), between two itemsets \(X\) and \(Y\) is a directed relationship of the itemset \(X\) showing the presence of \(Y\). The rule is not symmetric: \(X \Rightarrow Y\) and \(Y \Rightarrow X\) are not the same.
The confidence for the rule \(X \implies Y\), \(\text{Conf}(X \Rightarrow Y)\) is defined by
\[ \text{Conf}(X \Rightarrow Y) = \frac{\text{Supp}(X \cup Y)}{\text{Supp}(X)}. \]
So, to calculate the confidence for a rule:
\[ \text{Supp}(\text{\{milk, bread\}} \Rightarrow \text{\{butter\}}) = \frac{0.2}{0.4} = 0.5. \]
To illustrate how rules are not symmetric:
\[ \text{Supp}(\text{\{butter\}} \Rightarrow \text{\{milk, bread\}}) = \frac{0.2}{0.6} = 0.33. \]
Finally, we want a measure of the strength of the relationship between the itemsets \(X\) and \(Y\). That is, measuring the effect of the presence of \(X\) on the presence of \(Y\). We measure this by defining the lift of a rule as
\[ \text{Lift}(X \Rightarrow Y) = \frac{\text{Supp}(X \cup Y)}{\text{Supp}(X) \text{Supp}(Y)}. \]
Again, we repeat our calculations for our rule.
\[ \text{Lift}(\text{\{bread, milk\}} \Rightarrow \text{\{butter\}}) = \frac{0.2}{(0.4)(0.6)} = \frac{0.2}{0.24} = 0.8333 \]
Lift values greater than 1 implies the presence of \(X\) increases the probability of \(Y\) being present when compared to the unconditional probability.
Now that we have these metrics and concepts, we can turn our attention to trying to find rules in a given dataset, using these metrics to rank them.
Rather than using brute-force approaches to discovering these rules, we use a number of different algorithms to find associations within the dataset.
The two main algorithms for discovering some rules are the apriori and the
eclat algorithms.
2.2 Construct apriori Rules
We now want to construct the association rules using the apriori algorithm.
To do this, we need to set parameters such as the minimum support and the
minimum confidence level.
This gives us a set of association rules, along with the support and lift.
basket_apriori <- apriori(
basket_arules,
parameter = list(supp = 0.005, conf = 0.8)
)## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.005 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 207
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4965 item(s), 41410 transaction(s)] done [0.38s].
## sorting and recoding items ... [1398 item(s)] done [0.02s].
## creating transaction tree ... done [0.03s].
## checking subsets of size 1 2 3 4 5 6 done [0.20s].
## writing ... [423 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
basket_apriori_tbl <- basket_apriori %>%
as("data.frame") %>%
as_tibble() %>%
arrange(desc(lift))
basket_apriori_tbl %>% glimpse()## Rows: 423
## Columns: 6
## $ rules <chr> "{22917,22918} => {22916}", "{22916,22918} => {22917}", "{2…
## $ support <dbl> 0.005143685, 0.005143685, 0.005047090, 0.005047090, 0.00560…
## $ confidence <dbl> 0.9466667, 0.9508929, 0.9500000, 0.9372197, 0.9317269, 0.92…
## $ coverage <dbl> 0.005433470, 0.005409321, 0.005312726, 0.005385173, 0.00601…
## $ lift <dbl> 157.4356, 156.2558, 156.1091, 155.8645, 153.1064, 153.1064,…
## $ count <int> 213, 213, 209, 209, 232, 232, 213, 224, 224, 225, 225, 209,…
We now want to inspect this table using the ruleExplorer()
basket_apriori %>% ruleExplorer()To help visualise these rules, we can produce a basic scatterplot of the metrics.
ggplot(basket_apriori_tbl) +
geom_point(aes(x = confidence, y = lift), alpha = 0.2) +
xlab("Rule Confidence") +
ylab("Rule Lift") +
ggtitle("Scatterplot of Association Rule Metrics")2.3 Construct eclat Rules
An alternative method of constructing association rules is to use the eclat
algorithm. The code for doing this is slightly different, but gives us similar
outputs.
basket_eclat <- eclat(
basket_arules,
parameter = list(support = 0.005)
) %>%
ruleInduction(
basket_arules,
confidence = 0.8
)## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.005 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 207
##
## create itemset ...
## set transactions ...[4965 item(s), 41410 transaction(s)] done [0.38s].
## sorting and recoding items ... [1398 item(s)] done [0.02s].
## creating sparse bit matrix ... [1398 row(s), 41410 column(s)] done [0.02s].
## writing ... [6766 set(s)] done [3.20s].
## Creating S4 object ... done [0.00s].
basket_eclat_tbl <- basket_eclat %>%
as("data.frame") %>%
as_tibble() %>%
arrange(desc(lift))
basket_eclat_tbl %>% glimpse()## Rows: 423
## Columns: 5
## $ rules <chr> "{22917,22918} => {22916}", "{22916,22918} => {22917}", "{2…
## $ support <dbl> 0.005143685, 0.005143685, 0.005047090, 0.005047090, 0.00560…
## $ confidence <dbl> 0.9466667, 0.9508929, 0.9500000, 0.9372197, 0.9317269, 0.92…
## $ lift <dbl> 157.4356, 156.2558, 156.1091, 155.8645, 153.1064, 153.1064,…
## $ itemset <int> 52, 52, 56, 56, 59, 59, 52, 54, 54, 53, 53, 56, 34, 34, 55,…
Once again, we inspect the data using ruleExplorer()
basket_eclat %>% ruleExplorer()2.4 Compare Algorithms
We now want to compare the outputs of both algorithms in terms of association rules and how they compare.
basket_ap_tbl <- basket_apriori_tbl %>%
select(rules, support, confidence, lift)
basket_ec_tbl <- basket_eclat_tbl %>%
select(rules, support, confidence, lift)
rules_comparison_tbl <- basket_ap_tbl %>%
full_join(basket_ec_tbl, by = "rules", suffix = c("_a", "_e"))
rules_comparison_tbl %>% glimpse()## Rows: 423
## Columns: 7
## $ rules <chr> "{22917,22918} => {22916}", "{22916,22918} => {22917}", "…
## $ support_a <dbl> 0.005143685, 0.005143685, 0.005047090, 0.005047090, 0.005…
## $ confidence_a <dbl> 0.9466667, 0.9508929, 0.9500000, 0.9372197, 0.9317269, 0.…
## $ lift_a <dbl> 157.4356, 156.2558, 156.1091, 155.8645, 153.1064, 153.106…
## $ support_e <dbl> 0.005143685, 0.005143685, 0.005047090, 0.005047090, 0.005…
## $ confidence_e <dbl> 0.9466667, 0.9508929, 0.9500000, 0.9372197, 0.9317269, 0.…
## $ lift_e <dbl> 157.4356, 156.2558, 156.1091, 155.8645, 153.1064, 153.106…
2.5 Reducing Minimum Confidence
While high confidence rules are useful, they are more likely to find rules that are “obvious” as the probabilities are such that co-occuring basket items will be noticed as being together - or possibly be natural complements: butter, milk and bread is a good example.
Instead, we are also interested in less obvious rules, and so we reduce our confidence threshold and see how many additional rules are discovered.
basket_lower_rules <- apriori(
basket_arules,
parameter = list(supp = 0.005, conf = 0.4)
)## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.005 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 207
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4965 item(s), 41410 transaction(s)] done [0.40s].
## sorting and recoding items ... [1398 item(s)] done [0.02s].
## creating transaction tree ... done [0.03s].
## checking subsets of size 1 2 3 4 5 6 done [0.21s].
## writing ... [4991 rule(s)] done [0.01s].
## creating S4 object ... done [0.01s].
basket_lower_rules_tbl <- basket_lower_rules %>%
as("data.frame") %>%
as_tibble() %>%
arrange(desc(lift))ggplot(basket_lower_rules_tbl) +
geom_point(aes(x = confidence, y = lift), alpha = 0.2) +
xlab("Rule Confidence") +
ylab("Rule Lift") +
ggtitle("Scatterplot of Association Rule Metrics")3 Converting Rules to Graphs
We also have the ability to convert these rules to a graph representation,
where each node is either a stock_code or a rule, with the edges of the
graph representing that item being contained in the rule.
apriori_rules_igraph <- basket_apriori %>%
plot(
measure = "support",
method = "graph",
control = list(max = 1000)
) %>%
as("igraph")apriori_rules_igraph %>% glimpse()## List of 10
## $ :List of 1
## ..$ 218: 'igraph.vs' Named int [1:3] 205 206 207
## .. ..- attr(*, "names")= chr [1:3] "assoc72" "assoc73" "assoc74"
## .. ..- attr(*, "env")=<weakref>
## .. ..- attr(*, "graph")= chr "8428c1f4-3630-4809-9a13-ad5aa06b219b"
## $ :List of 1
## ..$ 219: 'igraph.vs' Named int [1:13] 178 199 206 223 225 227 294 295 296 298 ...
## .. ..- attr(*, "names")= chr [1:13] "assoc45" "assoc66" "assoc73" "assoc90" ...
## .. ..- attr(*, "env")=<weakref>
## .. ..- attr(*, "graph")= chr "8428c1f4-3630-4809-9a13-ad5aa06b219b"
## $ :List of 1
## ..$ 220: 'igraph.vs' Named int [1:4] 226 234 238 239
## .. ..- attr(*, "names")= chr [1:4] "assoc93" "assoc101" "assoc105" "assoc106"
## .. ..- attr(*, "env")=<weakref>
## .. ..- attr(*, "graph")= chr "8428c1f4-3630-4809-9a13-ad5aa06b219b"
## $ :List of 1
## ..$ 221: 'igraph.vs' Named int [1:13] 205 207 224 231 233 237 295 296 297 301 ...
## .. ..- attr(*, "names")= chr [1:13] "assoc72" "assoc74" "assoc91" "assoc98" ...
## .. ..- attr(*, "env")=<weakref>
## .. ..- attr(*, "graph")= chr "8428c1f4-3630-4809-9a13-ad5aa06b219b"
## $ :List of 1
## ..$ 249: 'igraph.vs' Named int [1:13] 333 334 335 336 337 338 339 340 341 342 ...
## .. ..- attr(*, "names")= chr [1:13] "assoc200" "assoc201" "assoc202" "assoc203" ...
## .. ..- attr(*, "env")=<weakref>
## .. ..- attr(*, "graph")= chr "8428c1f4-3630-4809-9a13-ad5aa06b219b"
## $ :List of 1
## ..$ 250: 'igraph.vs' Named int [1:14] 334 335 336 337 345 352 355 368 369 370 ...
## .. ..- attr(*, "names")= chr [1:14] "assoc201" "assoc202" "assoc203" "assoc204" ...
## .. ..- attr(*, "env")=<weakref>
## .. ..- attr(*, "graph")= chr "8428c1f4-3630-4809-9a13-ad5aa06b219b"
## $ :List of 1
## ..$ 251: 'igraph.vs' Named int [1:5] 355 356 357 358 359
## .. ..- attr(*, "names")= chr [1:5] "assoc222" "assoc223" "assoc224" "assoc225" ...
## .. ..- attr(*, "env")=<weakref>
## .. ..- attr(*, "graph")= chr "8428c1f4-3630-4809-9a13-ad5aa06b219b"
## $ :List of 1
## ..$ 256: 'igraph.vs' Named int [1:2] 259 260
## .. ..- attr(*, "names")= chr [1:2] "assoc126" "assoc127"
## .. ..- attr(*, "env")=<weakref>
## .. ..- attr(*, "graph")= chr "8428c1f4-3630-4809-9a13-ad5aa06b219b"
## $ :List of 1
## ..$ 257: 'igraph.vs' Named int [1:69] 269 360 361 379 380 385 388 389 390 405 ...
## .. ..- attr(*, "names")= chr [1:69] "assoc136" "assoc227" "assoc228" "assoc246" ...
## .. ..- attr(*, "env")=<weakref>
## .. ..- attr(*, "graph")= chr "8428c1f4-3630-4809-9a13-ad5aa06b219b"
## $ :List of 1
## ..$ 259: 'igraph.vs' Named int [1:92] 218 266 267 378 379 380 381 382 383 384 ...
## .. ..- attr(*, "names")= chr [1:92] "assoc85" "assoc133" "assoc134" "assoc245" ...
## .. ..- attr(*, "env")=<weakref>
## .. ..- attr(*, "graph")= chr "8428c1f4-3630-4809-9a13-ad5aa06b219b"
## - attr(*, "class")= chr "igraph"
We should first visualise this graph, using the top 30 rules in the dataset, as measured by the support of the rule.
basket_apriori %>%
head(n = 30, by = "support") %>%
plot(
measure = "lift",
method = "graph",
engine = "htmlwidget"
)3.1 Extract Connected Product Labels
First we want to look at the different disjoint components of the graph, and label them with an ID.
apriori_rules_tblgraph <- apriori_rules_igraph %>%
as_tbl_graph() %>%
mutate(
component_id = group_components()
) %>%
group_by(component_id) %>%
mutate(
component_size = n()
) %>%
ungroup()We then want to create groups of common products that form a disjoint cluster within this graph.
product_groups_all_tbl <- apriori_rules_tblgraph %>%
activate(nodes) %>%
as_tibble() %>%
filter(are_na(support)) %>%
group_by(component_id) %>%
mutate(
product_count = n()
) %>%
ungroup() %>%
select(product_group_id = component_id, stock_code = label) %>%
arrange(product_group_id, stock_code)
product_groups_all_tbl %>% glimpse()## Rows: 133
## Columns: 2
## $ product_group_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ stock_code <chr> "20711", "20712", "20713", "20718", "20719", "20723",…
For display purposes, we can show all the stock_id values in a list.
3.1.1 Cluster Larger Groups
Within the large disjoint cluster there are a large number of products so rather than treating this as a single group we instead may investigate using further graph clustering algorithms to create further groupings.
apriori_rules_large_tblgraph <- apriori_rules_tblgraph %>%
to_subgraph(component_size == max(component_size)) %>%
use_series(subgraph) %>%
morph(to_undirected) %>%
mutate(
sub_id = group_edge_betweenness()
) %>%
unmorph()Now that we have sub-divided this large subgraph, we repeat the process.
product_groups_largest_tbl <- apriori_rules_large_tblgraph %>%
activate(nodes) %>%
as_tibble() %>%
filter(are_na(support)) %>%
group_by(sub_id) %>%
mutate(
product_count = n()
) %>%
ungroup() %>%
select(product_group_id = sub_id, stock_code = label) %>%
arrange(product_group_id, stock_code)
product_groups_largest_tbl %>% glimpse()## Rows: 61
## Columns: 2
## $ product_group_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2,…
## $ stock_code <chr> "20719", "20723", "20724", "20725", "20726", "20727",…
Finally, it is worth trying to use an interactive tool to investigate this
subgraph, we we can use visNetwork() to produce an interactive JS tool
apriori_rules_large_tblgraph %>%
toVisNetworkData(idToLabel = FALSE) %>%
visNetwork(
nodes = .$nodes %>% transmute(id, label, group = sub_id),
edges = .$edges
)3.2 Evaluating Product Groups
How do we go about assessing the validity of these product groups?
Note that this work is exploratory - in effect this is more sophisticated data exploration. Rather than use this model to make predictions - a job we will need to do at some point, we instead just want to assess how novel these grouping are.
To that end, it may be useful to check the co-occurrence of these products as a group - for each purchase made by a customer, what proportion of the group was featured in this data?
This question is worth exploring, so we should write some code to assess this.
Before we do this, we combine our two lists of product groups into a single table.
stock_groups_tbl <- list(
ALL = product_groups_all_tbl,
LRG = product_groups_largest_tbl
) %>%
bind_rows(.id = "type") %>%
mutate(
group_label = sprintf("%s_%02d", type, product_group_id)
) %>%
group_by(group_label) %>%
mutate(
group_size = n()
) %>%
ungroup() %>%
select(group_label, group_size, stock_code)
stock_groups_tbl %>% glimpse()## Rows: 194
## Columns: 3
## $ group_label <chr> "ALL_01", "ALL_01", "ALL_01", "ALL_01", "ALL_01", "ALL_01"…
## $ group_size <int> 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61…
## $ stock_code <chr> "20711", "20712", "20713", "20718", "20719", "20723", "207…
tnx_groups_tbl <- tnx_data_tbl %>%
select(invoice_id, invoice_date, stock_code) %>%
group_nest(invoice_id, .key = "invoice_data")
arules_groups_tbl <- stock_groups_tbl %>%
group_nest(group_label, group_size, .key = "stock_data")
group_props_tbl <- arules_groups_tbl %>%
filter(group_size > 1, group_size < 15) %>%
expand_grid(tnx_groups_tbl) %>%
mutate(
comb_data = future_map2(
invoice_data, stock_data,
inner_join,
by = "stock_code",
.options = furrr_options(globals = FALSE)
),
match_count = map_int(comb_data, nrow),
group_prop = match_count / group_size
) %>%
select(group_label, group_size, group_prop) %>%
filter(group_prop > 0)
group_props_tbl %>% glimpse()## Rows: 63,177
## Columns: 3
## $ group_label <chr> "ALL_02", "ALL_02", "ALL_02", "ALL_02", "ALL_02", "ALL_02"…
## $ group_size <int> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10…
## $ group_prop <dbl> 0.2, 0.6, 0.3, 0.3, 0.1, 0.7, 0.2, 0.4, 0.2, 0.4, 0.1, 0.2…
We now create a histogram of the proportions for each group, and this gives us a gauge of the ‘novelty’ of each of these groups.
plot_tbl <- group_props_tbl %>%
mutate(label = glue("{group_label} ({group_size})"))
ggplot(plot_tbl) +
geom_histogram(aes(x = group_prop), binwidth = 0.1) +
facet_wrap(vars(label), scales = "free_y") +
scale_y_continuous(labels = label_comma()) +
xlab("Proportion") +
ylab("Purchase Count") +
ggtitle("Facetted Histograms of Group Coverages by Product Grouping") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))3.2.1 Write Product Groups
As this may be useful for later analysis and for later modelling, we output these groupings for later use.
stock_groups_tbl %>% write_rds("data/stock_groups_tbl.rds")4 R Environment
sessioninfo::session_info()## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.0.4 (2021-02-15)
## os Ubuntu 20.04.2 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2021-04-26
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## arules * 1.6-7 2021-03-16 [1] RSPM (R 4.0.4)
## arulesViz * 1.4-0 2021-03-07 [1] RSPM (R 4.0.3)
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.3)
## backports 1.2.1 2020-12-09 [1] RSPM (R 4.0.3)
## bookdown 0.21 2020-10-13 [1] RSPM (R 4.0.2)
## broom 0.7.5 2021-02-19 [1] RSPM (R 4.0.3)
## bslib 0.2.4 2021-01-25 [1] RSPM (R 4.0.3)
## cachem 1.0.4 2021-02-13 [1] RSPM (R 4.0.3)
## cellranger 1.1.0 2016-07-27 [1] RSPM (R 4.0.3)
## cli 2.3.1 2021-02-23 [1] RSPM (R 4.0.3)
## codetools 0.2-18 2020-11-04 [2] CRAN (R 4.0.4)
## colorspace 2.0-0 2020-11-11 [1] RSPM (R 4.0.3)
## conflicted * 1.0.4 2019-06-21 [1] RSPM (R 4.0.0)
## cowplot * 1.1.1 2020-12-30 [1] RSPM (R 4.0.3)
## crayon 1.4.1 2021-02-08 [1] RSPM (R 4.0.3)
## crosstalk 1.1.1 2021-01-12 [1] RSPM (R 4.0.3)
## DBI 1.1.1 2021-01-15 [1] RSPM (R 4.0.3)
## dbplyr 2.1.0 2021-02-03 [1] RSPM (R 4.0.3)
## digest 0.6.27 2020-10-24 [1] RSPM (R 4.0.3)
## dplyr * 1.0.5 2021-03-05 [1] RSPM (R 4.0.3)
## DT * 0.17 2021-01-06 [1] RSPM (R 4.0.3)
## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.3)
## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.3)
## fansi 0.4.2 2021-01-15 [1] RSPM (R 4.0.3)
## farver 2.1.0 2021-02-28 [1] RSPM (R 4.0.3)
## fastmap 1.1.0 2021-01-25 [1] RSPM (R 4.0.3)
## forcats * 0.5.1 2021-01-27 [1] RSPM (R 4.0.3)
## foreach 1.5.1 2020-10-15 [1] RSPM (R 4.0.3)
## fs 1.5.0 2020-07-31 [1] RSPM (R 4.0.3)
## furrr * 0.2.2 2021-01-29 [1] RSPM (R 4.0.3)
## future * 1.21.0 2020-12-10 [1] RSPM (R 4.0.3)
## generics 0.1.0 2020-10-31 [1] RSPM (R 4.0.3)
## ggplot2 * 3.3.3 2020-12-30 [1] RSPM (R 4.0.3)
## globals 0.14.0 2020-11-22 [1] RSPM (R 4.0.3)
## glue * 1.4.2 2020-08-27 [1] RSPM (R 4.0.3)
## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.3)
## haven 2.3.1 2020-06-01 [1] RSPM (R 4.0.3)
## highr 0.8 2019-03-20 [1] RSPM (R 4.0.3)
## hms 1.0.0 2021-01-13 [1] RSPM (R 4.0.3)
## htmltools 0.5.1.1 2021-01-22 [1] RSPM (R 4.0.3)
## htmlwidgets 1.5.3 2020-12-10 [1] RSPM (R 4.0.3)
## httr 1.4.2 2020-07-20 [1] RSPM (R 4.0.3)
## igraph 1.2.6 2020-10-06 [1] RSPM (R 4.0.3)
## iterators 1.0.13 2020-10-15 [1] RSPM (R 4.0.3)
## jquerylib 0.1.3 2020-12-17 [1] RSPM (R 4.0.3)
## jsonlite 1.7.2 2020-12-09 [1] RSPM (R 4.0.3)
## knitr 1.31 2021-01-27 [1] RSPM (R 4.0.3)
## labeling 0.4.2 2020-10-20 [1] RSPM (R 4.0.3)
## lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.4)
## lifecycle 1.0.0 2021-02-15 [1] RSPM (R 4.0.3)
## listenv 0.8.0 2019-12-05 [1] RSPM (R 4.0.3)
## lubridate 1.7.10 2021-02-26 [1] RSPM (R 4.0.3)
## magrittr * 2.0.1 2020-11-17 [1] RSPM (R 4.0.3)
## Matrix * 1.3-2 2021-01-06 [2] CRAN (R 4.0.4)
## memoise 2.0.0 2021-01-26 [1] RSPM (R 4.0.3)
## modelr 0.1.8 2020-05-19 [1] RSPM (R 4.0.3)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.3)
## parallelly 1.24.0 2021-03-14 [1] RSPM (R 4.0.3)
## pillar 1.5.1 2021-03-05 [1] RSPM (R 4.0.3)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.3)
## ps 1.6.0 2021-02-28 [1] RSPM (R 4.0.3)
## purrr * 0.3.4 2020-04-17 [1] RSPM (R 4.0.3)
## R6 2.5.0 2020-10-28 [1] RSPM (R 4.0.3)
## Rcpp 1.0.6 2021-01-15 [1] RSPM (R 4.0.3)
## readr * 1.4.0 2020-10-05 [1] RSPM (R 4.0.4)
## readxl 1.3.1 2019-03-13 [1] RSPM (R 4.0.3)
## registry 0.5-1 2019-03-05 [1] RSPM (R 4.0.0)
## reprex 1.0.0 2021-01-27 [1] RSPM (R 4.0.3)
## rlang * 0.4.10 2020-12-30 [1] RSPM (R 4.0.3)
## rmarkdown 2.7 2021-02-19 [1] RSPM (R 4.0.3)
## rmdformats 1.0.1 2021-01-13 [1] RSPM (R 4.0.3)
## rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.0.3)
## rvest 1.0.0 2021-03-09 [1] RSPM (R 4.0.3)
## sass 0.3.1 2021-01-24 [1] RSPM (R 4.0.3)
## scales * 1.1.1 2020-05-11 [1] RSPM (R 4.0.3)
## seriation 1.2-9 2020-10-01 [1] RSPM (R 4.0.2)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.3)
## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.3)
## stringr * 1.4.0 2019-02-10 [1] RSPM (R 4.0.3)
## tibble * 3.1.0 2021-02-25 [1] RSPM (R 4.0.3)
## tidygraph * 1.2.0 2020-05-12 [1] RSPM (R 4.0.3)
## tidyr * 1.1.3 2021-03-03 [1] RSPM (R 4.0.4)
## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.3)
## tidyverse * 1.3.0 2019-11-21 [1] RSPM (R 4.0.3)
## TSP 1.1-10 2020-04-17 [1] RSPM (R 4.0.0)
## utf8 1.2.1 2021-03-12 [1] RSPM (R 4.0.3)
## vctrs 0.3.7 2021-03-29 [1] RSPM (R 4.0.4)
## visNetwork 2.0.9 2019-12-06 [1] RSPM (R 4.0.3)
## withr 2.4.1 2021-01-26 [1] RSPM (R 4.0.3)
## xfun 0.22 2021-03-11 [1] RSPM (R 4.0.3)
## xml2 1.3.2 2020-04-23 [1] RSPM (R 4.0.3)
## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.3)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library